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ABSTRACT 


This  thesis  describes  the  design  methodology  and  the  process  of  employing 
the  GENESIL  Silicon  Compiler  (GSC)  (Version  7.1)  in  the  layout  of  a  pipelined 
multiplier,  in  1.5  micron  CMOS  technology,  using  a  parallel  multiplier  cell 
array.  Additionally,  background  material  on  the  GSC,  the  theory  of 
multiplication,  as  well  as  the  concept  and  theory  of  pipelining  are  presented. 

The  results  revealed  two  practical  limits  of  the  GSC  system  which  precluded 
achieving  the  high  component  density  made  possible  by  full  custom,  "manual" 
CAD  methods  using  graphic  layout  tools.  Although  the  GSC  system  did  not 
perform  as  desired  in  this  study,  it  offers  a  viable  alternative  to  the  labor- 
intensive,  full  custom,  VLSI  graphic  layout  tools  in  use  today. 
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I.  INTRODUCTION 


A.  BACKGROUND 

Multiplication  is  often  an  essential  function  in  many  digital  systems.  For 
example,  a  multiplier  is  a  necessary  part  of  any  digital  signal  processing  circuit 
[Ref.  1].  In  many  signal  processing  operations,  such  as  correlation,  convolution, 
filtering,  and  frequency  analysis,  one  needs  to  perform  multiplication  [Ref.  2], 
and,  in  order  to  perform  real-time  signal  processing,  a  high-speed  multiplier  is 
required  [Ref.  31  Additionally,  in  the  majority  of  digital  signal  processing 
applications  the  critical  processing  paths  usually  involve  many  multiplications 
[Ref.  4].  Clearly,  fast  digital  multipliers  are  one  of  the  most  important  building 
blocks  in  Very  Large  Scale  Integration  (VLSI)  chips  for  advanced  digital  signal 
processing. 

In  high-performance  systems,  many  of  the  above  operations  are  implemented 
with  bipolar  device  technology,  which  consumes  a  significant  amount  of  direct 
current  (DC)  power.  On  the  other  hand,  Complementary  Metal  Oxide 
Semiconductor  (CMOS)  technology  can  substantially  reduce  the  power 
consumption,  but  results  in  much  slower  device  speed. 

CMOS  is  a  combination  of  P-channel  and  N-channel  enhancement  metal 
oxide  semiconductor  field  effect  transistors  (MOSFETs)  used  in  a 
complementary  circuit  arrangement  that  is  useful  in  digital  logic  circuitry. 
Among  its  advantages  are  that  it  has  extremely  low  power  dissipation,  requires 
only  one  DC  power  supply,  operates  over  a  wide  range  of  supply  voltages,  and 
can  drive  as  many  as  50  gate-inputs  [Ref.  5].  The  fabrication  of  a  CMOS  IC 
(integrated  circuit)  requires  a  "prescription"  for  preparing  the  photomasks  that 
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will  be  used  in  the  manufacturing  process.  This  "prescription"  is  a  set  of  rules 
which  provides  a  link  between  the  circuit  designer  and  process  engineer  during 
the  manufacturing  phase.  The  rules  are  often  referred  to  as  layout  rules  or  as 
design  rules.  The  main  objective  of  the  layout  rules  is  to  make  a  circuit  with 
optimum  yield  in  as  small  an  area  (geometry)  as  possible  without  jeopardizing 
the  reliability  of  the  circuit  [Ref.  2].  There  are  several  ways  to  describe  the 
design  rules.  One  way  is  by  the  "micron"  rules  which  are  stated  as  some  micron 
resolution.  Micron  design  rules  are  usually  given  as  a  list  of  minimum  feature 
sizes  and  spacing  required  for  all  the  masks  in  a  given  fabrication  process  [Ref. 
2].  Hence,  as  indicated  in  the  abstract  of  this  report,  the  multipliers  designed  in 
this  thesis  have  a  minimum  feature  size  of  1.5  microns  in  CMOS  technology.  By 
incorporating  pipelining  into  the  design,  the  throughput  of  a  large  CMOS  circuit 
can  be  improved  significantly  [Ref.  4].  For  example,  the  results  of  a  study  by 
Hallin  and  Flynn  [Ref.  6]  indicated  that  pipelining  can  give  a  40  percent  increase 
in  adder  efficiency  and  a  230  percent  increase  in  multiplier  throughput. 

With  the  advent  of  high-speed  semiconductor  memory,  an  increasing 
mismatch  between  memory  access  and  multiplication  time  has  arisen. 
Consequently,  there  is  considerable  interest  in  parallel  array  multipliers  [Ref.  7]. 
An  array  multiplier  and  a  multiplier  using  a  Wallace  tree  are  well-known  for 
their  high-speed  multiplication  [Ref.  3].  The  previous  study  by  Hallin  and  Flynn 
[Ref.  6]  also  demonstrated  that  the  most  efficient  multiplier  is  a  maximally 
pipelined  tree  multiplier  which  was  shown  to  be  50  percent  more  efficient  than 
the  array  multiplier.  However,  because  unit  cells  in  the  array  multiplier  are  used 
repeatedly  its  layout  is  highly  modular.  Modularity  makes  the  array  multiplier 
more  favorable  than  a  tree  multiplier  for  VLSI  implementation.  Therefore, 
many  MOS  multipliers  have  been  fabricated  using  this  method  [Ref.  3]. 
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As  ICs  grow  increasingly  more  complex,  it  becomes  necessary  to  develop 
new  methods  to  manage  the  design  complexities,  as  well  as  the  expenses 
associated  with  the  design  and  testing  of  the  IC.  Also,  from  this  increase  in  IC 
complexity  arises  the  demand  for  faster  and  more  economical  methods  to 
streamline  the  design  process.  One  state-of-the-art  solution  to  meet  this  demand 
is  the  silicon  compiler.  A  silicon  compiler  is  a  computer  system  which  generates 
IC  layouts  from  high-level  descriptions.  The  advantage  that  a  silicon-compiler- 
based  process  has  over  a  custom  IC  system  design  process  is  that  the  latter 
requires  a  team  of  experts  in  the  fields  of  logic  implementation,  circuit 
simulation,  chip  layout,  and  testing.  However,  the  design  process  based  on  the 
silicon  compiler  may  be  accomplished  by  one  individual  utilizing  a  top-down, 
hierarchical  design  methodology  beginning  with  a  partitioned  chip  set, 
progressing  downward  into  individual  chips  and  modules,  and  terminating  at  the 
block  level.  There  is  far  less  time  required  to  design  a  IC  using  a  silicon 
compiler  than  for  a  full  custom,  "manual"  CAD  method  using  graphic  layout 
tools.  Thus,  one  can  see  that  the  silicon  compiler  provides  a  streamlined  method 
for  rapid  development  of  IC  systems  [Ref.  8].  The  disadvantages  of  the  silicon 
compiler  are  that  the  resulting  circuit  is  often  slower  and  the  layout  is  not  always 
efficient  in  its  use  of  area. 

B.  THESIS  GOALS 

The  motivation  for  this  thesis  was  to  learn  more  about  digital  multipliers,  as 
well  as  to  work  with  state-of-the-art  VLSI  circuit  design  tools.  The  main  goal  of 
this  thesis  was  to  design  a  pipelined  multiplier  using  the  GENESIL  Silicon 
Compiler.  Concomitant  with  this  goal  was  the  desire  to  learn  more  about  the 
concept  and  theory  of  pipelining.  An  emphasis  has  been  placed  on  documenting 
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the  thought  processes  that  went  into  the  multiplier  designs  in  this  thesis,  as  well 
as  the  problems  encountered  along  the  way.  Additionally,  it  was  a  goal  to  fully 
explore  and  probe  the  GENESIL  Silicon  Compiler  to  determine  its  practical 
limits  in  parallel  multiplier  array  design.  Finally,  there  was  an  attempt  to 
produce  a  document  that  could  be  understood  by  one  not  well  versed  in  digital 
design  methodology  by  first  reviewing  the  basis  concepts  of  digital  multipliers 
and  then  discussing  the  concept  and  theory  of  pipelining. 

The  following  is  a  description  of  each  of  the  chapters  which  follow: 

Chapter  2:  Introduces  the  reader  to  the  GENESIL  Silicon  Compiler. 

Chapter  3:  Presents  three  multiplier  formats:  serial,  serial/parallel,  and 
parallel. 

Chapter  4:  Presents  the  basic  concepts  of  pipelining. 

Chapter  5:  Discusses  the  design  process  of  a  pipelined  multiplier  array. 

Chapter  6:  Discusses  the  limitations  of  the  silicon  compiler. 

Chapter  7:  Concludes  the  thesis  with  a  summary  and  recommendations  for 
follow  on  multiplier  design. 
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n.  GENESIL  SILICON  COMPILER 


A.  INTRODUCTION 

The  purpose  of  this  chapter  is  to  introduce  the  reader  to  the  GENESIL 
Silicon  Compiler  (GSC)  system.  The  intent  is  to  present  a  broad  overview  of 
GSC  capabilities  so  that  the  reader  may  become  acquainted  with  the  features  used 
in  this  report.  For  a  detailed  description  of  the  GSC  system  the  reader  is  referred 
to  References  9  through  11. 

B.  GENESIL  SYSTEM  DESCRIPTION 

The  GSC  system  is  a  design  automation  software  system  which  allows 
systems  engineers  and  circuit  designers  to  design  complex  VLSI  computer  chips. 
GENESIL  produces  IC  designs  from  architectural  descriptions  and  allows  for 
their  verification.  Figure  1  shows  a  block  diagram  of  the  GSC  development 
system  and  Figure  2  depicts  the  overall  layout  of  the  GSC  system  hardware.  The 
GSC  design  tasks  and  activities  are  listed  in  Figure  3  and  it  is  these  activities  that 
will  be  emphasized  in  this  chapter. 

The  GSC  is  based  on  an  object-oriented  hierarchical  system  running  under 
the  UNIX  operating  system.  The  objects  consist  of  Blocks,  Modules,  Chips,  and 
Chip- sets. 

Use  of  the  GSC  system  does  not  require  design  considerations  at  the 
transistor  gate  level.  A  systems  engineer  or  circuit  designer  can  simply 
incorporate  into  his  layout  one  of  the  myriad  of  GSC  circuits  resident  in  the  GSC 
library.  The  resident  circuits  in  the  GSC  library  consist  of  random  access 
memory  (RAM),  read  only  memory  (ROM),  programmable  logic  arrays  (PLA), 
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arithmetic  logic  units  (ALU),  multipliers,  and  several  less  complex  circuits  such 
as  basic  logic  gates  and  data-path  elements  [Ref.  12]. 
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Figure  3  GENESIL  Design  Activities  [From  Ref.  12] 


Before  leaving  this  section  the  reader  should  become  acquainted  with  the 
following  tasks  and  activities  of  the  GSC  development  system  in  order  to  derive 
the  maximum  benefit  from  the  design  process  described  in  Chapter  5.  For  a 
detailed  explanation  of  each  task  or  activity  the  reader  is  referred  to  [Refs.  9-11]. 

C.  TASKS  AND  ACTIVITIES 
1.  DEFINITION 

The  DEFINITION  activity  is  the  process  whereby  the  user  defines  an 
object  using  the  options  provided  in  the  DEFINITION  menu.  Defining  an  object 
consists  of  accessing  the  HEADER  and  SPECIFICATION  forms  from  the 
DEFINITION  menu. 

A.  HEADER 

Use  of  the  HEADER  option  allows  the  user  to  display  the  HEADER 
form,  which  is  dependent  on  the  current  object  connected  to  the  user's  account. 
The  HEADER  form  allows  the  user  to  specify  the  technology  and  fabrication 
lines  (fablines)  to  be  utilized  in  the  users  design.  The  selected  choice  propagates 
down  the  entire  hierarchy.  The  fabline  selection  process  used  in  this  thesis  will 
be  discussed  in  Chapter  5. 

B.  SPECIFICATION 

Use  of  the  SPECIFICATION  form,  which  is  also  dependent  upon 
the  current  object  attached  to  the  user's  account,  allows  the  user  to  fill  in  detailed 
object  characteristics.  For  example,  if  one  were  using  a  FIFO  Block  in  his 
design,  he  could  specify  its  width,  depth,  output  register,  and  connectors  through 
use  of  the  SPECIFICATION  form. 
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2.  NETLISTING 

NETLISTING  allows  the  user  to  specify  the  interconnections  between 
Blocks  and  Modules  to  form  higher  level  functional  Modules.  This  is 
accomplished  through  the  use  of  NET_NETLIST  and  OBJECT_NETLIST.  It 
should  be  noted  that  they  both  provide  the  same  information  but  from  different 
points  of  reference. 

A.  NET_NETLIST 

NET_NETLIST  is  used  to  specify  the  signal  names  to  be  connected 
into  a  network,  and  once  they  are  defined,  the  GENESIL  System  then  creates  the 
network. 

B.  OBJECT _NETLIST 

OBJECT_NETLIST  allows  the  user  to  specify  the  signals  on  Blocks 
or  Submodules  in  a  Module  or  Modules  in  a  Chip,  and  the  GENESIL  system  then 
creates  the  connections  between  the  specified  objects. 

The  author  found  these  two  options  to  be  the  most  important  of  the 
GSC  options  used  in  this  thesis.  A  mastery  of  these  two  options  is  paramount  to  a 
successful  and  trouble-free  design  evolution.  It  was  preferable  to  establish  the 
initial  connections  with  OBJECT_NETLIST,  and,  if  errors  arose,  they  were 
investigated  with  NET_NETLIST.  NET_NETLIST  allows  one  to  trace  signal 
names  and  their  associated  connections. 

3.  FLOORPLANNING 

FLOORPLANNING  is  the  placement  of  objects  on  the  Chip,  the 
specification  of  their  FUSION  order,  and  the  connection  of  the  pins  to  the  pads 
of  the  Chip.  The  FLOORPLANNING  task  prepares  the  design  objects  for 
routing.  One  should  be  aware  the  FLOORPLANNING  activities  have  a 
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significant  influence  on  the  efficiency  of  the  router.  FLOORPLANNING  consists 
of  the  following  activities: 

A.  PLACEMENT 

PLACEMENT  specifies  an  object's  location  relative  to  other  objects 
in  a  Module  or  Chip.  This  is  usually  done  graphically  by  either  selecting  the  GSC 
AUTO-PLACEMENT  option  or  by  manual  PLACEMENT  by  the  user.  In 
almost  all  cases  the  author  preferred  manual  PLACEMENT  over  AUTO¬ 
PLACEMENT.  A  further  discussion  of  the  PLACEMENT  activity  will  be  held 
in  Chapter  5. 

B.  FUSION 

The  FUSION  activity  allows  the  user  to  graphically  create  and 
modify  the  assignments  of  routing  channels  on  the  floorplan  to  influence  wire 
routing.  This  option  was  not  frequently  used  in  this  study  although  some 
experimentation  was  conducted.  There  was  no  real  enhancement  observed  to  the 
designs  in  this  thesis  when  employing  this  option.  Because  the  compiling  process 
and  the  plotting  of  the  layout  designs  were  very  time-consuming  (on  the  order  of 
several  hours  for  large  layouts),  it  was  difficult  to  the  justify  the  investment  of 
time  for  what  little  effect  (if  any)  was  observed. 

C.  PINOUT 

PINOUT  assigns  external  signals,  both  on  and  off  the  Chip.  The  user 
must  be  aware  of  the  assignment  of  pins  as  it  affects  the  routing  both  on  and  off 
the  Chip. 

4.  COMPILE 

The  COMPILE  activity  can  be  initiated  by  the  user  or  by  the  GENESEL 
system.  GENESIL  automatically  performs  a  currency  check  on  all  objects,  and  if 
any  are  determined  to  be  out  of  date  it  does  a  compile  before  any  of  the  activities 
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requiring  compilation.  A  design  must  first  be  compiled  before  any  significant 
activity  can  be  started.  Here,  the  author  found  it  to  be  a  time-saving  investment  if 
modular  subcomponents  were  first  compiled  prior  to  building  larger  arrays 
incorporating  these  same  subcomponents. 

5.  FUNCTIONAL  SIMULATION 
A.  SIM V LATE 

SIMULATE  is  the  operation  to  simulate  the  logical  functioning  of 
the  IC  design  under  consideration.  One  may  test  the  IC  design  using  automatic 
test  vectors  or  by  initiating  manual  simulation  by  binding  the  input  pins  to  a  "0" 
or  "1"  and  manually  advancing  the  time.  Note  that  this  process  does  not  check  the 
timing  of  the  circuit.  The  manual  method  was  used  to  test  and  simulate  the 
designs  reported  on  in  this  thesis.  For  large  numbers,  the  product  was  verified 
with  an  HP-28S  hand-held  calculator.  This  topic  is  elaborated  on  in  Chapter  5. 

6.  TIMING  ANALYSIS 

The  GENESIL  Timing  Analyzer  can  calculate  and  report  on  the 
following  areas: 

•  Speed  at  which  the  object  under  analysis  will  run. 

•  Paths  that  limit  the  clock  frequency. 

•  Duty-cycle  (phase  high  time)  constraints. 

•  Input  setup  and  hold  times. 

•  Output  delays. 

•  Setup  and  hold  times  and  signal  delays  for  any  internal  nodes. 

•  Path  delays  between  internal  nodes. 
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III.  MULTIPLIER  BASICS 


A.  BASIC  MULTIPLIER  DESIGN 

This  section  provides  a  brief  review  of  basic  multiplier  design  as  background 
before  discussing  the  parallel  multiplier  arrays  implemented  in  this  report.  The 
formats  that  will  be  discussed  are  the  serial  form,  serial/parallel  form,  and  the 
parallel  form;  the  Wallace  tree  multiplier  will  also  be  briefly  discussed.  One 
should  keep  in  mind  that  the  selection  of  a  specific  multiplier  to  be  incorporated 
in  a  particular  design  is  based  on  speed,  throughput,  numerical  accuracy,  and 
area  [Ref.  2]. 

Before  beginning  a  discussion  on  the  various  forms  mentioned  above,  the 
most  basic  form  of  multiplication  will  be  discussed  first.  This  is  shown  in  Figure 
4  which  illustrates  the  multiplication  of  two  positive  binary  integers,  14io  and 

7i0. 


multiplicand;  1110  :  1 4 10 

multiplier  ;  0111  :  7i0 

1  1  10 
1  1  10 
1  1  10 
0000 

1100010  :  98 !0 


Figure  4  Basic  Form  of  Multiplication 
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The  multiplication  is  accomplished  through  successive  additions  and  shifts. 
This  multiplication  process  may  be  separated  into  the  following  two  steps: 

•  Evaluation  of  partial  products. 

•  Addition  of  the  shifted  partial  products. 

It  should  be  pointed  out  that  one-bit  binary  multiplication  is  equivalent  to  a 
logical  AND  operation.  Thus,  the  evaluation  of  partial  products  consists  of  the 
logical  ANDing  of  the  multiplicand  and  its  associated  bit  in  the  multiplier. 

I.  Serial  Multiplier 

The  simplest  example  of  a  serial  multiplier  is  illustrated  in  Figure  5. 
Here,  multiplication  is  accomplished  through  a  successive  addition  algorithm  and 
is  implemented  using  a  full  adder,  a  logical  AND,  a  delay  element,  and  a  serial- 
to-parallel  register.  The  numbers  X  and  Y  are  presented  serially  to  the  circuit 
and  the  partial  product  is  evaluated  for  each  bit  of  the  multiplier.  Next,  a  serial 
addition  is  performed  with  the  partial  additions  previously  stored  in  the  register. 
The  G2  gate  resets  the  partial  sum  at  the  beginning  of  the  multiplication  cycle 
[Ref.  2J. 
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Figure  5  Basic  Serial  Multiplier  [From  Ref.  2] 
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2.  Serial/Parallel  Multiplier 

The  basic  implementation  of  the  serial/parallel  multiplier  form  is 
illustrated  in  Figure  6.  Here,  multiplication  is  performed  by  successive  additions 
of  columns  of  the  shifted  partial  products.  As  left-shifting  by  one  bit  in  serial 
systems  is  accomplished  by  a  1-bit  delay  element,  the  multiplier  is  successively 
shifted  and  gates  the  appropriate  bit  of  the  multiplicand.  The  bits  of  the  delayed, 
gated  multiplicand  must  all  be  in  the  same  column  of  the  shifted  partial  product. 
They  are  added  to  form  the  product  bit  corresponding  to  the  appropriate  column 
{Ref.  2). 


y 


Figure  6  Basic  Structure  for  Serial/Parallel  Multiplier 

[From  Ref.  2] 
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3.  Parallel  Multiplier 

The  parallel  multiplier  form  is  the  one  utilized  in  the  design  of  the 
multipliers  in  this  thesis.  This  form  was  selected  primarily  because,  when 
incorporated  into  an  array,  the  unit  cells  of  the  multiplier  can  be  used  repeatedly, 
resulting  in  a  highly  modular  arrangement.  Recall  that  this  characteristic  makes 
the  parallel  array  multiplier  favorable  for  VLSI  implementation. 

In  a  parallel  multiplier  the  partial  products  in  the  multiplication  process 
can  be  independently  computed  in  parallel.  For  example,  in  the  case  of  two 


unsigned  binary  integers  X  and  Y : 

X 


(3.1) 


The  product  is  found  by 

m  - 1  n  - 1 

Pr  =  XyYr  =  X  Xi2'  ■  X  Yj2j 

i  =  0  j  =  o  (3.3) 

m  -  1  n  -  1 

=  I  I  (XiYj)21  +  j 

i  =  0  j  =  0 
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The  partial  product  terms  P*  are  called  summands.  There  are  mn 
summands,  which  are  produced  in  parallel  by  the  multiplication  of  mn  AND 
gates  [Ref.  2].  Figure  7  illustrates  the  partial  products  formed  by  the 
multiplication  of  two  4-bit  numbers. 
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Figure  7  4-Bit  Multiplier  Partial  Products  [From  Ref  2] 

For  an  n  x  n  multiplier  the  required  number  of  components  would  be 
n(n-2)  full  adders,  n  half  adders,  and  n2  AND  gates.  The  worst-case  delay 
associated  with  such  a  multiplier  is  (2n  =  l)tg,  where  Tg  is  the  worst-case  adder 
delay.  Figure  8  illustrates  a  typical  parallel  multiplier  cell  which  forms  the  basis 
of  the  multipliers  designed  in  this  thesis. 


Figure  8  Parallel  Multiplier  Cell  [From  Ref.  2] 


Note  in  Figure  8  above,  that  the  Xj  term  is  propagated  vertically,  while 
the  Yj  term  is  propagated  horizontally,  and  that  the  partial  products  enter  at  the 
top  left  of  each  cell.  A  bit-wise  AND  is  performed  in  each  cell,  and  the  SUM 
(Pi+i)  is  forwarded  to  the  next  cell  at  the  lower  right.  The  CARRY  OUT  (Ci+i) 
is  forwarded  out  the  bottom  of  the  cell.  Figure  9  illustrates  a  parallel  multiplier 
array  with  the  partial  products  formed  within  each  parallel  multiplier  cell. 


x,  x,  x,  x„ 


Figure  9  Parallel  Multiplier  Array  [From  Ref.  2] 

As  alluded  to  earlier,  an  important  feature  of  the  parallel  multiplier 
array  is  that  the  unit  cells  of  the  multiplier  can  be  used  repeatedly,  resulting  in  a 
highly  modular  arrangement.  This  arrangement  of  parallel  multiplier  cells  can 
be  drawn  as  a  square  array  as  indicated  in  Figure  10.  Here,  one  can  clearly  see 
how  the  Xi  and  Yj  terms  are  propagated  throughout  the  array  by  vertical  and 


17 


horizontal  feedthrough,  respectively.  As  mentioned  previously,  this  feature 
makes  the  parallel  array  multiplier  highly  favorable  for  VLSI  implementation. 


X,  X2  x,  *„ 


P7 


Figure  10  Parallel  Multiplier  Array  Drawn  as  a  Square  Array 

[From  Ref.  2] 

4.  Wallace  Tree 

A  general  discussion  of  digital  multiplier  design  would  not  be  complete 
without  some  mention  of  the  Wallace  tree.  As  stated  earlier,  a  study  by  Hallin 
and  Flynn  [Ref.  6]  demonstrated  that  the  most  efficient  multiplier  is  a  maximally 
pipelined  tree  multiplier  which  was  shown  to  be  50  percent  more  efficient  (with 
less  overall  delay)  than  an  array  multiplier. 

The  Wallace  tree  layout  (Figure  11)  is  significant  in  that  it  utilizes  a 
matrix  generation  and  reduction  scheme,  which  is  the  fastest  way  to  perform 
parallel  multiplication.  However,  it  has  some  disadvantages  when  implemented  in 
VLSI.  The  full  Wallace  tree  is  topologically  difficult  to  implement.  Large 
Wallace  trees  are  difficult  to  map  onto  planes  since  each  carry-save  adder 
communicates  with  its  own  slice,  transmits  carries  to  the  higher  order  slice,  and 
receives  carries  from  a  lower  order  slice.  This  topology  creates  both  I/O  pin 
difficulty  and  wire  routing  problems  [Ref.  13].  Because  a  parallel  array  is  highly 
modular,  it  was  selected  over  the  Wallace  tree  for  implementation  in  the  GSC. 
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Figure  11  A  Wallace  Tree  [From  Ref.  14] 


19 


IV.  PIPELINING 


A.  INTRODUCTION 

The  purpose  of  this  chapter  is  to  introduce  the  reader  to  the  concept  and 
theory  of  pipelining.  As  indicated  in  the  title  of  this  thesis,  CMOS  technology 
was  utilized  in  the  implementation  of  the  parallel  multiplier  arrays  designed  in 
this  thesis.  It  was  previously  noted  that  CMOS  technology  can  substantially 
reduce  the  power  consumption  of  a  device,  but  results  in  a  much  slower  device 
speed.  Furthermore,  it  was  noted  that  a  parallel  multiplier  array  operates  at  a 
slower  speed  than  a  multiplier  tree  [Ref.  13].  By  incorporating  pipelining  into 
the  design,  however,  the  throughput  of  a  parallel  multiplier  array  may  be 
substantially  improved. 

B.  BASICS  OF  PIPELINING 

1.  Bandwidth  and  Latency 

When  one  reads  the  literature  on  pipelining  one  will  observe  that  the 
term  bandwidth  is  often  associated  with  pipelining.  Bandwidth  is  defined  as  the 
number  of  tasks  that  can  be  performed  per  unit  time  interval  [Ref.  13].  For  a 
system  that  operates  on  only  one  task  at  a  time,  latency  is  the  inverse  of 
bandwidth,  and  for  a  given  latency  the  bandwidth  can  be  increased  by  pipelining, 
which  allows  for  the  simultaneous  execution  of  many  tasks  [Ref.  13].  Figure  12 
illustrates  the  pipelining  concept  by  showing  that  a  system  with  latency  of  n  gate 
delays  can  be  operate  at  bandwidth  of  \/n  ,  2 In,  3/n,  etc.  Figure  13  illustrates  a 
pipelined  carry-save  multiplier  array;  note  the  placement  of  the  delay  gates.  This 
increase  in  bandwidth  may  be  accomplished  by  dividing  the  combinational  logic 
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into  separate  stages  which  are  in  turn  separated  by  latches  [Ref.  13].  The  goal  of 
designing  a  multiplier  using  pipelining  is  fast  operation.  If  some  function  can  be 
executed  in  X  ns,  and  the  design  can  be  separated  into  N  stages,  then  a  pipeline 
designed  to  perform  the  same  function  repeatedly  can  perform  that  function  in 
times  down  to  X/N  ns  [Ref.  14].  An  important  question  one  might  ask  regarding 
pipelining  is  what  is  the  maximum  rate  at  which  a  particular  pipeline  can 
operate.  This  is  discussed  in  the  following  section. 


(a) 


(b) 


(c) 


Increasing  bandwidth  by  pipelining. 

a.  nonpipelined  system  bandwidth  =  1/n. 

b.  2-stage  pipelined  system  bandwidth  =  2/n. 

c.  3-stage  pipelined  system  bandwidth  =  3/n. 


Figure  12  Increasing  Bandwidth  by  Pipelining  [From  Ref.  13] 


Pipelined  carry-save  multiplication  array.  The  square  boxes  are  «rr>-save 
adders  with  three  latches.  Each  square  box  has  three  inputs:  a  sum  an 
carry  from  previous  carry-save  adders,  and  the  third  »  the  partial  product 
X,  Y.-  The  ten  unmarked  rectangles  on  the  right  are  1-b.t  latches  to  keep 

correct  timing. 


Figure  13  Pipelined  Carry-Save  Multiplier  Array  [From  Ref.  13] 
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2.  Analysis  of  a  Pipelined  Stage 

The  following  definitions  are  commonly  used  in  the  analysis  of  pipelined 

stages: 

tx  =  propagation  time  through  combinational  logic 

(f)  for  this  stage  of  the  pipeline  (see  Figure  14  (a)  and  (b». 

tr  =  minimum  propagation  time  through  the  combinational  logic 
(f)  for  this  stage  of  the  pipelining. 

ts  =  flip-flop  setup  time;  the  amount  of  time  data  has  to  be  valid  prior  to 
the  clocking  edge. 

th  =  amount  of  time  data  must  be  valid  after  clocking  edge  (hold  time). 


<b)  Pipelined  stage 

Figure  14  A  Pipeline  Stage 

The  above  definitions  can  be  used  to  determine  the  timing  restrictions 
for  a  pipelined  circuit.  For  an  edge-triggered  D  Flip-flop; 

max  (tr  +  tx)  +  ts  ^  T 
min  (tr  +  tx)  >  *h 


V.  DESIGN  PROCESS  OF  A  PIPELINED  MULTIPLIER 

A.  DESIGN  CONSIDERATIONS 

This  chapter  will  describe  the  design  process  for  the  parallel  multiplier 
arrays  implemented  in  this  thesis.  The  previous  sections  were  provided  to 
establish  a  background  for  the  design  process.  To  gain  more  insight  into  the 
discussions  which  follow,  it  is  highly  recommended  that  the  reader  work  through 
the  tutorial  section  of  [Ref.  8],  although  this  is  not  an  absolute  requirement.  The 
GSC  system  manuals  include  a  tutorial  section.  However,  this  author  believes  it 
was  written  with  the  presumption  that  the  reader  had  attended  a  one-week  course 
of  instruction  taught  by  the  Silicon  Compiler  System  Corporation  of  San  Jose, 
California.  Withou.  this  course  of  instruction  the  user  may  have  some  difficulty 
working  through  the  tutorial  sections  until  some  proficiency  has  first  been 
acquired. 

As  stated  earlier,  the  parallel  multiplier  array  of  Figure  8  (incorporating  the 
parallel  multiplier  cell)  was  selected  for  implementation  in  the  GSC.  This 
decision  was  based  primarily  on  the  array's  modular  architecture.  It  was  also 
apparent  that  its  feature  of  horizontal  and  vertical  feedthrough  was  advantageous 
for  implementation  in  VLSI  because  the  routing  of  the  inputs  Xi  and  Yi 
throughout  the  entire  array  would  be  simplified. 

1.  Modeling  the  Parallel  Multiplier  Cell 

One  of  the  first  design  considerations  contemplated  was  how  to  model 
the  basic  parallel  multiplier  cell  of  Figure  8.  In  Figure  8,  the  bit-wise  ANDing  of 
the  partial  products  occurs  inside  the  cell's  boundaries.  The  results  of  each  bit¬ 
wise  AND  is  summed  with  the  SUM  of  another  multiplier  cell,  as  well  as  with  a 
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CARRY  IN.  The  author  determined  that  this  cell  could  be  implemented  in 
GENESIL  by  using  a  1-bit  full  adder  with  one  input  being  provided  by  the 
output  of  an  AND  gate  (from  the  formation  of  the  partial  products)  and  the  other 
from  the  SUM  of  another  adder.  Note  that  a  I -bit  full  adder  also  provides  for  a 
CARRY  IN  and  CARRY  OUT.  Figure  15  shows  the  basic  cell  and  its  layout  is 
illustrated  in  Figure  16. 


Xi  Yt 


l  AND ) 

SUM  IN  CARRY  IN 


/ 

\  B  C!N 

1 

BIT  FULL 

ADDER 

OUT  COUT 

SUM  OUT  CARRYOUT 

Figure  15  Parallel  Multiplier  Cell  for  Implementation  in  GENESIL 
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Figure  16  GENESIL  Layout  of  a  Parallel  Multiplier  Cell 

(101.6  mils2) 
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2.  Selecting  a  Fabline 

The  next  design  consideration  was  to  select  a  "fabline",  that  is,  a 
particular  set  of  design  rules  used  by  a  foundry  to  manufacturer  a  Chip.  Because 
Stuart  [Ref.  15]  did  a  full  custom  parallel  multiplier  array  design  using  1.5 
CMOS,  the  same  micron  technology  was  selected  for  this  study  to  enable  a 
comparison  of  results.  Figure  17  shows  the  fablines  available  for  selection. 
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Figure  17  Selection  of  a  Fabline 
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Note  that  fablines  which  include  the  number  15  are  1.5  pm  technology. 
To  assist  in  the  selection  of  a  particular  1 .5  CMOS  fabline  speed  was  used  as  the 
criterion.  To  determine  which  fabline  was  the  fastest,  a  timing  analysis  was 
performed  on  four  adders  each  incorporating  a  different  1.5  pm  fabline.  Figure 
18  illustrates  a  linear  view  of  a  GENESIL  1-bit  full  adder  (note  the  labeling  of 
the  signal  lines),  and  Figure  19  illustrates  the  layout  of  a  1-bit  GENESIL  full 
adder. 


Figure  18  Linear  View  of  a  GENESIL  1-Bit  Full  Adder 
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Figure  19  GENESIL  Layout  of  a  1-Bit  Full  Adder 

The  results  of  the  timing  analysis  are  listed  in  Table  1.  The  NSC_CN15A 
fabline  was  selected  because  it  had  the  smallest  maximum  output  delay  for  both 
the  CARRY  OUT  (cout(OJ)  and  the  SUM  OUT  (sout[OJ). 
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TABLE  1 

OUTPUT  DELAYS  FOR  A  GENESIL  1-BIT  FULL  ADDER 


cout[0] 

sout[0] 

Phi  (r)  Delay(ns) 

Fabline 

Min 

Max 

Min 

Max 

ISilSI 

ESSIS 

ESfllSS 

TSB  CP15A 

2.8 

2.8 

7.2 

8.91 

4.28 

38.08 

NCR  CN15A 

8.4 

8.4 

8.91 

4.28 

38.08 

US2  CN15A 

IBSI 

8.1 

1 

7.5 

10.09 

4.85 

48.91 

NSC  CN15A 

2.1 

5.1 

3.9 

mm 

8.91 

4.28 

38.08 

Note:  1  mil  =  0.001  inches 


In  addition  to  the  1-bit  full  adder,  a  GENESIL  D  flip-flop  was  also 
tested  to  determine  if  there  was  a  difference  in  the  output  delay  for  each  1 .5  pm 
fabline.  The  results  are  listed  in  TABLE  2.  As  expected,  in  view  of  the  results  in 
TABLE  1,  the  NSC_CN15A  fabline  produced  a  shorter  output  delay  than  the 
other  fablines.  Figure  20  illustrates  a  linear  view  of  a  GENESIL  D  flip-flop  and 
Figure  21  illustrates  the  GENESIL  layout  of  a  D  flip-flop. 
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TABLE  2 

OUTPUT  DELAY  FOR  A  GENESIL  D  FLIP-FLOP 


Phi  (r)  Delay(ns) 


Fabline 

Min 

Max 

BHHBI 

TSB  CP15A 

4.5 

5.0 

3.27 

8.46 

27.63 

NCR  CN15A 

6.0 

■a 

2.88 

7.46 

21.51 

US2  CN15A 

4.8 

5.8 

2.88 

7.46 

21.51 

NSC  CN15A 

3.8 

4.0 

2.88 

7.46 

21.51 

Figure  20  Linear  View  of  a  GENESIL  D  Flip-Flop 


Figure  21  GENESIL  Layout  of  a  D  Flip-Flop 
The  following  section  will  begin  describing  the  design  process  and  the 
integration  of  the  parallel  multiplier  cells  into  functional  multiplier  arrays. 
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B.  DESIGN  OF  A  4-BIT  PIPELINED  MULTIPLIER  ARRAY 
1.  Signal  Naming  Scheme 

The  author  made  a  decision  early  in  the  implementation  phase  to  first 
demonstrate  the  feasibility  and  functionality  of  the  parallel  multiplier  array  by 
constructing  a  4-bit  unsigned  multiplier.  Once  the  basic  design  was  validated,  a 
pipelined  version  and  larger  arrays  were  then  constructed. 

Using  a  CAD,  a  4-bit  version  of  Figure  9  was  drafted  and  is  shown  in 
Figure  22.  However,  before  the  drawing  could  be  made  it  was  necessary  to 
devise  a  signal  naming  scheme.  A  requirement  was  set  that  this  scheme  must 
impart  some  information  on  the  origin  of  a  signal,  to  assist  in  trouble  shooting 
the  circuit,  as  well  as  be  applicable  to  all  of  the  parallel  multipliers  implemented 
in  this  thesis. 

Therefore,  the  scheme  was  based  on  a  labeling  convention  similar  to  that  of  a 
full  adder.  For  example,  the  signals  SUM  OUT  and  CARRY  OUT  were  labeled 
as  product  out  "po"  and  carry  out  "co",  respectively.  These  labels  were  further 
modified  to  "po kj"  and  "co kj",  where  k  indicates  the  level  number  and  j  indicates 
the  adder  position  in  a  particular  level.  Here,  k  ranges  from  0  to  n  ,  where  n  is 
the  number  of  bits  the  multiplier  is  capable  of  operating  on.  The  j  indicates  the 
position  of  the  adder  from  the  right-hand  side  of  the  level  in  which  it  is  located 
and  it  ranges  from  0  to  n  -  1.  For  example,  ”po23"  indicates  the  signal  "product 
out"  from  level  2  adder  3.  Additionally,  all  AND  gates  were  labeled  according  to 
the  partial  products  they  form.  For  example,  X2Y0  indicates  the  ANDing  of  the 
partial  products  X2  and  Yo.  Furthermore,  each  row  of  adders  were  labeled  as 
"level_k"  and  each  adder  was  labeled  as  "ADDkj",  where  k  and  j  correspond  to 


33 


UVEL-P 


LEVEL- 1 


LEVEL 


LEVEL.J 


LEVEL- 4 


Figure  22  CAD  Layout  of  a  4-Bit  Parallel  Multiplier  Array 
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the  level  number  and  the  adder's  position,  respectively.  Finally,  the  last  row  of 
adders  was  labeled  as  "FAPx"  where  x  indicates  a  particular  final  product.  For 
example,  "FAP4"  indicates  the  final  adder  whose  output  is  product  4. 

2.  4-Bit  Multiplier  Array 

From  the  very  start  of  the  construction  phase  for  the  4-bit  multiplier 
array,  there  were  questions  regarding  what  method(s)  and  what  Blocks  or 
Modules  should  be  employed  to  build  the  arrays.  The  first  approach  at 
constructing  the  array  was  to  create  a  random  logic  Block  (labeled  multi_4bit). 
After  selecting  the  fabline  NSC_CN15  for  this  Block,  19  full  adders,  16  AND 
gates,  and  one  OR  gate  were  attached  to  it  through  the  use  of  the  options 
SPECIFICATION  and  NEW.  These  components  were  then  connected  as  in 
Figure  22  by  indicating  the  appropriate  signal  names  in  the  SPECIFICATION 
form.  The  SIGNALS  function  was  then  used  to  designate  whether  a  particular 
signal  was  an  "input,  output  or  bi-level."  This  first  attempt  resulted  in  a  long 
"stick-like”  structure  (see  Figure  23)  which  would  not  be  suitable  for  a  Chip 
layout  simply  due  to  its  inefficient  use  of  space.  If  larger  multipliers  were 
constructed  using  this  method  one  would  produce  long  arrays  whose  length 
would  be  proportional  to  the  number  of  bits  to  be  multiplied.  Therefore,  other 
methods  were  sought  to  reduce  the  length  of  the  array. 

One  method  considered  was  to  simply  divide  the  array  into  rows  of 
adders  (similar  to  Figure  10)  according  to  their  level  by  putting  each  row  of 
adders  in  random  logic  Block.  Each  random  logic  Block  would  then  be  attached 
to  a  general  random  logic  Module  (labeled  4bmm;  for  4-bit  multiplier  module) 
and  the  rows  of  adders  would  be  interconnected  again  as  in  Figure  22.  When 
implemented,  this  method  proved  successful  in  reducing  the  previous  "stick-like" 
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structure  to  a  more  compact  modular  arrangement.  Figure  24  is  the  GENES1L 
layout  of  this  new  modular  arrangement. 


Figure  23  GENESIL  Layout  of  multi_4bit 
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Figure  24  GENESIL  Layout  of  4bmm  (1,958.3  mils2) 
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The  construction  of  the  rows  of  adders  (levels)  in  the  modular 
arrangement  was  accomplished  through  the  employment  of  a  generic  "level_k". 
As  stated  previously,  a  random  logic  Block  was  defined  and  four  adders  and  four 
AND  gates  were  attached  to  it.  The  Block  was  then  label  as  level_k.  Through  the 
use  of  "ATTACH  EXISTING",  while  the  Module  4bmm  was  at  the  top  of  the 
hierarchy,  the  generic  level_k  was  successively  attached.  Each  time  level_k  was 
attached  to  the  Module  it  was  renamed  according  to  it  assigned  level  in  Figure 
22.  The  last  row  of  adders  was  constructed  by  simply  deleting  the  AND  gates  and 
1-bit  full  adder  from  the  generic  level_k,  and  attaching  an  OR  gate.  The  generic 
level_k  is  illustrated  in  Figure  25.  Figure  26  is  a  GENESIL  linear  view  of  the 
generic  level_k.  A  CAD  drawing  of  the  general  random  logic  Module  4bmm 
illustrating  its  block  level  layout  is  shown  in  Figure  27. 


37 


GENERIC  -  RANDOM  LOGIC  BLOCK  CALLED  "leveUk" 


ADDkJ 

ADDkj 

ADDkj 

ADDkj 

GENERIC  -  RANDOM  LOGIC  BLOCK  IS  COMPOSED  OF  4 
ADDER/AND  COMBINATIONS 


k  =  level  (increasing  from  top  to  bottom)  and  J=  adder  position 
(increasing  from  right  to  left) 

k  from  0  to  n,  where  n  =  number  of  bits  the  multiplier  is  the 

capable  of  operating  on. 

j  from  0  to  n  -  1 

i.e.  ADD02  :  level__0  ,  adder  number  2 


x .  Y 


A  B  CIN 


I  BIT  TULL 
ADDER 

OUT  COUT 


SUM  OUT  CARRY  OUT 


EACH  ADDER/AND  COMBINATION  IS  COMPOSED  OF 
A  PARALLEL  MULTIPLIER  CELL 


Figure  25  CAD  Depiction  of  Generic  Level_k 


level_k" 
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Figure  26  GENESIL  Linear  View  of  Generic  Level  k 


GENERAL  MODULE  (Random  Logic)  called  "4bmm" 
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A .  Version  1 


After  a  close  inspection  of  Figure  24  (from  this  point  on  this  layout 
will  be  referred  to  as  4bmm.l  to  indicate  version  1  of  4bmm)  the  author  decided 
that  the  modular  arrangement  of  Figure  27  was  probably  the  best  one  to  use 
when  implementing  parallel  multiplier  arrays  in  GENESIL.  This  decision  was 
based  primarily  on  the  modular  arrangement  of  the  parallel  multiplier  cells,  as 
well  as  the  overall  symmetry  of  the  layout. 

Before  attempting  to  improve  on  the  initial  layout  of  Figure  24,  the 
functionality  of  the  multiplier  array  was  verified.  This  was  a  simple  task  and  was 
accomplished  as  described  on  page  102  of  Reference  7.  Several  different  binary 
numbers  were  multiplied  and  their  resulting  products  were  verified  using  a 
hand-held  HP-28S  calculator.  The  following  is  an  example  of  how  multiplication 
was  performed  by  GSC.  The  assignment  of  binary  values  to  the  inputs  of 
4bmm.l,  x[3:0J  and  y[3:0j,  and  the  product  of  multiplication  is  illustrated  in 
Figures  28  and  29,  respectively. 
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Figure  28  Assignment  of  Binary  Values  to  Inputs  of  4bmm.l 
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Figure  29  Product  of  Multiplying  1001x1001  Using  4bnini.l 

Following  the  verification  of  the  functionality  of  4bmm.l,  a  timing 
analysis  was  performed  to  determine  the  output  delays  for  each  product  P[7:0]. 
This  was  accomplished  by  selecting  TIMING  from  the  Executive  menu  and 
executing  OUTPUT_DELAY.  The  results  are  listed  in  Figure  30  and  indicate 
4bmm.l  can  theoretically  be  operated  at  approximately  29  MHz  (1/34.7  ns).  This 
calculation  is  based  on  the  output  delay  of  P7  since  it  is  the  limiting  product;  it 
has  the  largest  maximum  delay  of  the  other  products. 

Once  4bmm.l  was  verified  to  be  operating  correctly,  attempts  were 
made  to  improve  the  speed  and  reduce  the  size  of  the  array,  by  experimenting 
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with  changing  the  order  and  the  location  of  the  adder  levels  and  by  replacing 
"FAP4-6"  with  a  GENESIL  library  3-bit  adder. 

B.  Version  2 

Version  two  of  the  array  was  created  by  replacing  the  final  adders 
of  level_4  (FAP4-6)  with  a  GENESIL  library  3-bit  adder  (see  Figure  31).  As  in 
version  one,  a  functional  verification  was  conducted  first  before  performing  a 
timing  analysis.  The  results  of  the  timing  analysis  are  listed  in  Figure  32  and  the 
layout  of  4bmm.2  is  shown  in  Figure  33.  One  can  see  from  the  results  in  Figure 
32  that  the  use  of  the  GENESIL  library  3-bit  adder  in  level_4  resulted  in  a  slight 
reduction  in  the  output  delay  for  P7.  The  operating  speed  was  calculated  to  be 
approximately  30  MHz,  and  there  was  no  significant  change  in  size.  However, 
comparing  the  layout  of  level_4  of  version  1  and  2  shows  that  the  GENESIL  3- 
bit  adder  of  version  2  is  of  higher  density  than  the  3  individual  1-bit  adders  of 
version  1. 
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Figure  30  Timing  Analysis  of  4bmm.l 
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C.  Version  3 

Version  3  (4bmm.3)  was  the  first  attempt  at  reordering  the  adder 
levels  to  determine  what  effect  this  would  have  on  the  size  and  speed  of  the 
array.  When  developing  versions  1  and  2,  the  ordering  of  the  levels  was 
determined  by  the  AUTO_PLACEMENT  option  from  the  PLACEMENT  menu 
which  is  a  submenu  of  FLOORPLANNING.  Although  the  specifications  of  the 
array  were  entered  into  the  GSC  as  in  Figure  22,  this  did  not  necessarily 
guarantee  that  the  levels  would  be  oriented  in  the  same  manner.  When 
performing  FLOORPLANNING  the  user  can  elect  to  use  either 
AUTO_PLACEMENT  or  manual  PLACEMENT  to  arrange  the  relative 
positions  of  the  levels.  For  versions  1  and  2  AUTO_PLACEMENT  was  selected. 
It  uses  an  algorithm  built  into  the  GSC  to  determir  the  best  placement  of  the 
individual  levels.  Figure  34  illustrates  the  AUTO_PLACEMENT  of  the  adder 
levels  as  determined  by  the  GSC.  Note  that  the  order  is  arranged  according  to  the 
specifications  of  Figure  22,  with  the  exception  that  the  final  adders  (level_4)  are 
located  to  the  right  of  level_0. 


Figure  34  AUTO_PLACEMENT  of  Adder  Levels  (V1&2) 
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In  version  3  (4bmm.3)  the  order  was  rearranged  from  top  to 
bottom,  using  manual  PLACEMENT,  according  to  the  "logic  flow".  This 
reordering  is  illustrated  in  Figure  35.  Note  that  the  final  3-bit  adder  (level_4)  is 
now  located  below  level_3.  A  GENESIL  layout  of  this  arrangement  is  shown  in 
Figure  36. 


Figure  35  Reordering  of  Adder  Levels  According  to  Logic  Flow 


Figure  36  GENESIL  Layout  of  4bmm.3  (1,845.63  mils2) 

From  the  results  of  a  timing  analysis  performed  on  4bmm.3  it  was 
determined  that  the  reordering  had  no  significant  effect  on  the  output  delay  of 
P7.  The  output  delay  for  P7  of  4bmm.2  was  32.5  ns  and  for  4bmm.3  it  was  32.4 
ns.  However,  there  was  a  6%  reduction  in  the  overall  size  of  the  array.  The 
4bmm.2  design  had  total  area  of  1964.02  mils2  while  that  of  4bmm.3  was 
calculated  to  be  1845.63  mils2.  Close  inspection  of  Figure  36  reveals  that  there  is 
almost  an  equal  distribution  of  metal  above  the  Final  adders  of  level_4.  One  can 
see  metal  stretching  from  the  lower  right  side  of  level_3  across  to  the  adders  of 
level_4.  Level_4  was  centered  directly  below  level_3  to  see  if  the  metal  routing 
could  be  more  equally  distributed  and  perhaps  further  reduce  the  total  area.  This 
was  accomplished  in  version  4  below. 

D.  Version  4 

As  stated  above,  version  four  (4bmm.4)  was  simply  a  centering  of 
level_4  directly  below  level_3.  The  layout  of  4bmm.4  is  shown  in  Figure  37. 
Again,  there  was  no  further  reduction  in  the  output  delay  of  P7,  however,  there 
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was  a  very  slight  reduction  in  the  size  of  the  array.  The  total  area  of  4bmm.4 
was  calculated  to  1835.9  mils2  which  is  a  1%  reduction  in  the  total  area  of 
4bmm.3.  Also,  note  that  the  metal  routing  between  levels  3  and  4  has  been 
thinned  out. 


Figure  37  GENESIL  Layout  of  4bnini.4  (1,835.9  mils2) 

3.  4-Bit  Multiplier  Array  with  Registered  Inputs/Outputs 
A .  Version  1 

When  multipliers  are  implemented  in  actual  circuits  they  are  often 
constructed  with  registered  inputs  and  outputs.  This  is  essential  for  pipelined 
multipliers.  Therefore,  a  bank  of  8  D  flip-flops  was  added  to  the  inputs,  x[3:0] 
and  y [3:0],  and  to  the  products  P[7:0]  as  illustrated  in  Figure  38  (labeled 
4bmml.RIRO).  Here,  AUTO_PLACEMENT  was  used  to  see  what  the  GSC 
system  would  determine  to  be  the  best  placement  of  the  adder  levels  and  the  two 
banks  of  D  flip-flops.  The  resulting  floorplan  is  shown  in  Figure  39.  Note 
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how  the  AUTO_PLACEMENT  algorithm  placed  the  input  registers  next  to  the 
level_3  adders.  One  can  see  similarities  here  between  the  floorplans  of  4bmm.l 
and  4bmm.2  of  Figure  34.  It  appears  the  AUTO_PLACEMENT  algorithm 
favors  the  placement  of  level_4  next  to  level_0.  Figure  40  is  a  GENESIL  layout 
of  4bmml.RIRO. 
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Figure  39  AUTO_PLACEMENT  of  4bmml.RIRO 
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Figure  40  GENESIL  Layout  of  4bmml.RIRO  (2,551.69  mils2) 

B.  Version  2 

Version  2  (4bmm2.RIRO)  is  4bmm.4  with  registered  inputs  and 
outputs.  It  was  implemented  in  the  same  fashion  as  4bmml.RIRO,  however, 
manual  PLACEMENT  was  used  instead  of  AUTO_PLACEMENT.  The  input  and 
output  registers  were  manually  placed  as  drawn  in  Figure  38,  and  the  resulting 
floorplan  is  illustrated  in  Figure  41.  Here,  one  can  see  an  overlap  between 
adjacent  levels.  The  was  done  manually  to  determine  what  effect  overlap  would 
have  on  the  GSC.  The  resulting  layout  of  4bmm2.RIRO  is  shown  in  Figure  42. 
The  total  area  of  4bmm2.RlRO  was  2459.07  mils2  while  4bmml.RIRO  totaled 
2551.69  mils2.  The  4bmm2.RIRO  design  resulted  in  approximately  a  3.6  % 
reduction  in  area  compared  to  4bmml.RIRO,  and  had  a  much  "cleaner"  looking 
layout. 
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Figure  42  GENESIL  Layout  of  4bmm2.RIRO  (2,459.07  mils2) 

4.  4-Bit  Pipelined  Multiplier  Array 

After  experimenting  with  the  4-bit  multiplier  array,  the  author 
concluded  that  the  best  arrangement  for  the  registers  and  adder  levels  was  as 
indicated  in  Figure  42.  As  demonstrated  by  the  timing  analysis  for  4bmm.2  and 
4bmm.3,  there  was  no  significant  reduction  in  the  output  delay  of  P7  when  the 
adder  levels  were  oriented  in  the  order  of  "logic  flow".  However,  it  was 
demonstrated  that  orienting  the  adder  levels  in  the  order  of  the  "logic  flow" 
resulted  in  an  overall  reduction  in  array  area.  With  this  in  mind,  it  was  decided 
to  orient  the  pipelined  version  of  the  4-bit  multiplier  array  in  the  same  manner; 
that  is,  in  the  order  of  the  "logic  flow." 

Before  designing  the  4-bit  pipelined  version  it  was  necessary  to 
determine  between  what  levels  to  insert  a  bank  of  D  flip-flops.  From  inspection 
of  Figure  32,  it  was  decided  to  insert  a  row  of  flip-flops  between  level_2  and 
level_3  (see  Figure  43).  This  would  provide  for  two  pipelined  stages  without 
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Figure  43  CAD  Drawing  of  a  4-Bit  Pipelined  Multiplier  Array 

(4bniniPL) 
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splitting  up  the  library  3-bit  adder  into  individual  adder  units  as  was  previously 
done.  The  first  stage  requires  approximately  17.6  ns  to  propagate  the  partial 
multiplication  products  while  the  second  stage  requires  approximately  14.9  ns 
(32.5  ns  -  17.6  ns).  Here,  one  can  see  the  limiting  stage  is  comprised  of  level_0 
thru  level_2.  In  other  words,  the  multiplier  is  limited  to  the  pipelined  stage  with 
the  longest  delay.  However,  one  must  also  include  the  delay  of  the  D  flip-flops  in 
the  overall  timing  calculation.  The  theoretical  clock  period  (T)  is  determined 
from  the  sum  of  the  longest  pipelined  stage  delay  plus  the  flip-flop  delay  and  the 
setup  time  for  the  flip-flops.  Here,  the  assumption  is  made  that  all  stages  in  the 
pipeline  receive  the  same  clock  pulse  simultaneously.  In  reality,  due  to  circuit 
lengths,  loading,  and  driver  circuits  it  is  nearly  impossible  to  guarantee  that  all 
stages  of  a  pipelined  circuit  receive  the  same  clock  pulse  at  exactly  the  same  time. 
From  Table  2,  and  Figures  32  and  44,  T  is  estimated  at  23.1  ns  [17.6  ns  (slowest 
stage  delay)  +  4.0  ns  (D  flip-flop  delay)  +  1.5  ns  (setup  time)]. 
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Figure  44  Input  Setup  and  Hold  Times  for  4bmmPL 


56 


The  corresponding  clock  frequency  was  estimated  at  approximately  43  MHz 
(1/T).  The  theoretical  clock  frequency  for  4bmm2.RIRO  was  determined  to  be 
approximately  26  MHz  (1/38  ns)  [32.5  ns  (delay  for  entire  array)  +  4.0  ns  (D 
flip-flop  delay)  +  1.5  ns  (setup  time)].  4bmmPL  illustrates  the  increase  in 
throughput  when  pipelining  is  employed.  The  GENESIL  floorplan  and  layout  for 
4bmmPL  are  shown  in  Figures  45  and  46. 


Figure  45  Floorplan  for  4bmmPL 
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Figure  46  GENESIL  Layout  of  4bmmPL  (4,455.45  mils2) 
Following  the  construction  and  functional  verification  of  4bmmPL,  a 
timing  analysis  was  performed  to  determine  the  accuracy  of  the  predicted  clock 
speed  vs.  the  actual  clock  speed  as  determined  by  GENESIL.  The  option  "clocks" 
was  used  to  determined  the  worst  case  paths.  From  inspection  of  Figure  47,  one 
can  see  that  the  worst  case  path  was  determined  to  be  24.6  ns  or  approximately 
40  MHz.  This  indicates  the  predicted  value  was  in  error  by  approximately  7%.  It 
is  assumed  that  when  the  circuit  is  tested  as  a  whole,  greater  accuracy  is 
achievable  due  to  simulation  of  the  loading  conditions,  as  well  as  circuit  length 
delays. 
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Figure  47  Clock  Worst  Case  Paths  for  4bnimPL 
After  a  timing  analysis  was  conducted,  the  orientation  of  the  levels  and 
registers  were  varied  to  determine  if  a  smaller  layout  could  be  attained. 

The  first  attempt  at  decreasing  the  layout  of  4bmmPL  was  to  use 
GENESIL's  AUTO_PLACEMENT  algorithm  instead  of  manual  PLACEMENT 
during  the  FLOORPLANNING  process.  The  resulting  floorplan  is  shown  in 
Figure  48.  It  reveals  a  totally  different  perspective  on  arranging  the  Blocks 
which  comprise  4bmmPL.  One  can  see  how  the  algorithm  placed  the  pipeline 
register  (PL_1)  next  to  the  input  and  output  registers,  DFF_IN  and  DFF_OUT 
respectively.  The  resulting  GENESIL  layout  is  shown  is  Figure  49.  GENESIL’s 
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AUTO_PLACEMENT  algorithm  was  able  to  reduce  the  layout  by  approximately 
28%  by  simply  rearranging  the  Blocks  during  FLOORPLANNING. 
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Figure  48  Floorplan  from  AUTO_PLACEMENT  of  4bmniPL 
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Figure  49  GENESIL  Layout  of  4bmniPL  After 
AUTO  PLACEMENT  (3,476.5  mils*) 

After  observing  the  results  of  GENESIL’s  AUTO_PLACEMENT 
algorithm,  the  author  decided  to  "challenge"  GENESIL's  algorithm  by  splitting 
PL_1  of  Figure  43  in  an  attempt  to  further  reduce  the  total  area  of  4bmmPl.  The 
splitting  was  accomplished  by  using  two  banks  of  D  flip-flops.  One  bank 
contained  8  flip-flops  and  the  other  7.  The  two  banks,  labeled  PL_1A  and 
PL_1B,  were  manually  placed  at  the  sides  of  levels  1,  2,  and  3  as  illustrated  in 
Figure  50.  The  resulting  GENESIL  layout  is  shown  in  Figure  51.  Here,  one  can 
also  see  the  difference  between  what  is  shown  in  the  floorplan  view  and  the  final 


GENESIL  layout.  This  orientation  did  not  result  in  a  smaller  total  area  than  that 
achieved  by  GENESIL’s  AUTO.PLACEMENT  algorithm;  3477.5  mils2  versus 
3850.7  mils2 


Figure  50  Floorplan  of  Split  PL_1A  and  PL_1B  of  4bmmPL 


Figure  51  GENESIL  Layout  of  Split  PL  1A  and  PL_1B  of  4bmmPL 

(3,850.72  mils*) 
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A  final  attempt  at  reducing  the  area  was  accomplished  by  stacking 
PL_1A  on  top  of  PL_1B,  and  then  positioning  them  between  levels  2  and  3. 
AUTO_FUSION  was  then  selected.  The  resulting  layout  is  shown  in  Figure  52. 


Figure  52  Stacking  of  PL_1A  and  PL1B  of  Split  4bmmPL 
A  rather  surprising  result  was  observed.  It  appears  that  the 
AUTO_FUSION  option  "pushed"  the  two  stacked  registers  below  the  final  adders 
even  though  the  were  manually  placed  between  levels  2  and  3.  This  orientation 
was  not  successful  in  reducing  the  total  area  as  was  AUTO_PLACEMENT. 
Therefore,  one  must  conclude  that  GENESIL's  AUTO_PLACEMENT  algorithm 
is  better  able  to  place  the  individual  Blocks  of  4bmmPL  to  achieve  a  smaller  total 
area.  Even  though  it  was  demonstrated  that  the  orientation  in  Figure  49  resulted 
in  the  smallest  total  area,  it  was  decided  to  incorporate  the  orientation  of  Figure 
46  into  a  Chip  Module  to  better  illustrate  the  concept  of  pipelining.  Figure  53 
shows  the  floorplan  for  the  4-bit  multiplier  Chip  (4bmulti_chip)  and  its 
GENESIL  layout  is  shown  in  Figure  54. 
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Note  that  the  Chip  Module  4bmulti_chip  is  approximately  445%  greater  in  total 
area  than  4bmmPL. 

C.  DESIGN  OF  AN  8-BIT  PIPELINED  MULTIPLIER  ARRAY 
1.  8-Bit  Multiplier  Array 

After  the  design  of  the  4-bit  pipelined  multiplier  array  was  completed, 
efforts  were  directed  towards  developing  the  layout  of  an  8-bit  pipelined 
multiplier.  The  same  basic  techniques  used  in  the  development  of  the  4-bit 
multiplier  were  applied. 

A.  Version  1 

The  first  step  was  to  extend  the  CAD  drawing  of  Figure  22  to  an  8- 
bit  array.  Figures  55  and  56  show  the  CAD  drawing  for  an  8-bit  parallel 
multiplier  array  (version  1  was  labeled  8bmm.l).  Note  the  final  row  of  adders. 
Each  final  adder  (FAP8-FAP14)  is  a  1-bit  full  adder.  The  carryout  of  each  adder 
is  rippled  to  the  adjacent  adder  to  the  left.  A  generic  level_k,  comprised  of  8  full 
adders  and  8  AND  gates,  was  employed  to  construct  the  array. 

The  AUTO_PLACEMENT  algorithm  was  used  during  FLOOR¬ 
PLANNING  in  order  to  evaluate  its  placement  of  the  blocks  for  the  array. 
Figure  57  shows  the  results  of  GENESIL's  AUTO_PLACEMENT  algorithm  for 
8bmm.l.  One  can  see  a  similarity  to  Figure  24.  Note  how  the 
AUTO_PLACEMENT  algorithm  in  both  cases  positioned  the  smallest  block  at 
the  top  of  the  array.  Also,  note  in  Figure  57  that  the  levels  are  not  arranged  in 
the  order  of  "logic  flow."  Figure  58  shows  the  GENESIL  layout  for  8bmm.l 
with  a  total  area  of  8157.5  mils2.  One  can  see  a  thickening  of  metal  between 
level_2  and  the  other  adder  levels,  as  well  as  to  the  left  of  the  array  in  both  the 
upper  and  lower  regions. 
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Figure  55  CAD  Layout  (Upper  Half)  for  8bmm.l 
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Figure  57  Floorplan  for  8bmm.l 


Figure  58 


GENESIL  Layout  for  8bmm.l  (8,157.51  mils2) 
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Before  further  modifications  to  the  array  were  made,  the 
functionality  was  verified.  Following  the  functional  verification,  a  timing 
analysis  was  conducted  and  the  results  are  shown  in  Figure  59. 
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Figure  59  Timing  Analysis  for  8bmm.l 
B.  Version  2 

Following  the  functional  verification  and  timing  analysis  for 
8bmm.l,  the  orientation  of  the  ADDER/AND  levels  of  the  multiplier  was 
changed  to  reflect  the  order  of  logic  flow.  The  floorplan  for  this  orientation 
(labeled  8bmm.2)  is  shown  in  Figure  60.  Note  the  spacing  between  the  levels  of 
the  floorplan.  This  was  done  for  comparison  with  the  next  iteration  to  determine 
what  effect  spacing  and  overlap  would  have  on  the  overall  multiplier  size.  Figure 
62  shows  the  resulting  GENESIL  layout.  Comparing  Figures  58  and  61,  one  can 
see  the  latter  is  a  "cleaner"  looking  layout  with  minimal  metal  running 
throughout  the  array.  The  resulting  area  was  calculated  to  be  approximately 
8474.23  mils2  compared  to  8157.51  mils2  for  8bmm.l.  This  represents 
approximately  a  4%  increase  in  area.  A  timing  analysis  was  also  conducted  to 
determine  if  this  orientation  resulted  in  a  lower  propagation  delay  for  P15.  The 
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results  of  the  timing  analysis  indicate  that  there  was  no  significant  difference  in 
♦he  propagation  delay  for  P15  (52.3  ns  vs  53.5  ns  for  8bmm.2  and  8bmm.l, 
respectively). 

C.  Version  3 

The  next  iteration  (8bmm.3)  was  done  specifically  to  determine  if 
the  multiplier  area  could  be  reduced  if  adjacent  levels  were  slightly  overlapped 
during  FLOORPLANNING.  Figure  62  s^ws  how  the  individual  layers  were 
manually  placed  and  overlapped  during  the  FLOORPLANNING  process.  The 
resulting  layout  for  8bmm.3  was  similar  to  Figure  61. 


Figure  62  Floorplan  for  8binm3 
The  resulting  area  was  calculated  to  be  8513.23  mils2.  This 
represents  an  increase  of  approximately  1%  over  8bmm.2.  This  suggest  that 
overlapped  levels  will  be  separated  by  a  slightly  greater  amount  than  if  they  were 
adjoining  each  other. 
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D.  Version  4 

The  next  iteration  (8bmm.4)  was  a  modification  to  8bmm.3  by 
replacing  the  final  individual  1-bit  adders  with  a  7-bit  adder.  As  observed  in 
4bmm.2,  it  was  expected  that  the  propagation  delay  of  the  final  product  (here 
PI 5)  would  be  reduced.  Figure  63  shows  this  modification  to  level_8.  The 
floorplan  for  8bmm.4  was  identical  to  8bmm.3  (see  Figure  62). 
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The  resulting  layout  is  shown  in  Figure  64.  Close  inspection  of  level_8  reveals  a 
higher  density  for  the  7-bit  adder  than  for  the  individual  adders  of  8bmm.3.  A 
timing  analysis  was  performed  on  8bmm.4  and  the  results  are  shown  in  Figure 
65.  As  expected,  the  delay  for  PI 5  of  8bmm.4  was  reduced  by  6.5  ns  (67.6  ns  - 


Figure  64 


GENESIL  Layout  for  8bmm.4  (8,539.21  mils*) 
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Figure  65  Timing  Analysis  for  8bmm.4 
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61.1  ns)  which  represents  an  reduction  of  approximately  6%  in  propagation 
delay. 

E.  Version  5 

The  last  iteration  of  this  particular  orientation  centered  the  final  row 
of  adders  directly  below  the  last  level  of  the  array  as  in  4bmm.4.  The  layout 
(8bmm.5)  is  shown  in  Figure  66  which  resulted  in  a  reduction  of  approximately 
2%  in  total  area  over  that  of  8bmm.4.  Also,  there  was  no  change  in  the  timing 
analysis;  it  was  the  same  as  for  8bmm.4  (Figure  65). 


Figure  66  GENES1L  Layout  of  8bmm.5  (8,395.65  mils2) 

F.  Version  6 

The  last  version  of  the  8-bit  multiplier  (8BITMOD)  array  was 
constructed  from  four  4-bit  multiplier  array  modules  (see  Figure  22).  The 
floorplan  for  8BITMOD  is  shown  in  Figure  67.  Each  4-bit  multiplier  array 
module  was  attached  to  a  common  general  module,  as  well  as  a  single  random 
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logic  Block  containing  the  final  adders.  Although  this  particular  orientation  did 
not  result  in  a  reduction  in  total  area,  the  design  was  very  useful  in  learning  how 


4-b)tblk2 

4-bftblk  1 

4-b1tb1k4 

4-bttblk3 

ADDER 


Figure  67  Floorplan  for  8BITMOD 
to  use  OBJECT_NETLIST  and  NET_NETLIST.  8BITMOD  required  extensive 
use  of  OBJECT_NETLIST  when  interconnecting  the  four  individual  modules, 
particularly,  when  routing  signals  across  the  module  boundaries.  For  example,  a 
signal  can  be  identified  inside  a  module  as  signal  "x"  but  when  the  signal  line 
leaves  the  module  and  is  routed  to  another  module,  one  can  change  its  name  to 
signal  "y".  This  property  was  very  useful  and  minimized  the  requirement  to 
"customize"  each  individual  4-bit  multiplier.  The  GENESIL  layout  for 
8BITMOD  is  shown  in  Figure  68.  The  total  area  is  approximately  8993.1  mils2. 
This  was  the  largest  of  the  8-bit  parallel  multiplier  arrays. 

Before  starting  the  design  of  the  pipelined  version  of  the  8-bit 
parallel  multiplier  array,  a  decision  had  to  be  made  regarding  what  orientation  to 
implement.  Based  on  size  only,  8bmm.l  (Figure  58)  would  be  favored  because 
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it  had  the  smallest  area.  However,  due  to  the  size  (width)  of  the  D  flip-flops 
required  to  pipeline  the  array,  the  orientation  of  8bmm.5  (see  Figure  66)  was 
selected.  The  decision  to  implement  the  orientation  of  8bmm.5  was  also  based  on 


Figure  68  GENESIL  Layout  for  8BITMOD  (8,993.1  mils2) 
the  inherent  symmetry  of  the  array  which  would  lend  itself  to  simple  horizontal 
cuts  for  inserting  the  pipeline  registers. 

2.  8-Bit  Pipelined  Multiplier  Array 

The  first  step  in  designing  the  pipelined  8-bit  multiplier  array  was  to 
inspect  the  timing  analysis  of  8bmm.5  to  determine  between  what  levels  the 
pipelined  registers  should  be  inserted.  Based  on  the  output  delays  of  8bmm.5 
listed  in  Table  3,  the  array  was  divided  into  four  pipelined  stages.  The  product 
out  of  the  first  stage  (P2)  was  available  after  a  17.6  ns  propagation  delay  and  the 
outputs  from  the  other  stages  were  nearly  a  multiple  of  this  delay. 
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TABLE  3  TIMING  ANALYSIS  FOR  8BMM.5 
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61.1 

Table  3  suggest  inserting  registers  between  products  P2/P3,  P5/P6,  and 
P9/P10  which  will  result  in  nearly  equal  delays  for  each  stage.  This  corresponds 
to  inserting  registers  between  levels  2/3,  5/6,  and  P9/P10  of  Figures  55  and  63. 
The  insertion  of  registers  between  P9/P10  required  a  modification  to  the  final 
row  of  adders  in  level_8.  This  modification  (8bmm.5A)  is  shown  in  Figure  69 
below.  It  was  necessary  to  split  the  original  7-bit  adder  of  8bmm.5  into  a  5-bit 
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Figure  69  Modification  to  Level_8  (8bnim.5A) 
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timing  analysis  was  conducted  on  8bmm.5A  and  the  results  are  shown  in  Figure 
70  below. 
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Figure  70  Timing  Analysis  for  8bmm.5A 
The  results  show  a  17.8  ns  delay  for  the  stage  1  (levels  0-2),  a  16.5  ns 
delay  for  stage  2  (levels  3-5),  a  16.4  ns  delay  for  stage  3  (level_6  thru  P9),  and 
an  8.8  ns  delay  for  stage  4,  the  final  row  of  adders.  This  is  summarized  in  Table 
4  below. 


TABLE  4  OUTPUT  DELAYS  FOR  PIPELINED  STAGES  1-4 


STAGE 

LEVELS 

OUTPUT  DELAYS  (ns) 

1 

0-2 

17.8 

2 

3-5 

16.5 

3 

6-P9 

16.4 

4 

P10-P15 

8.8 
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Following  the  timing  analysis,  a  CAD  drawing  depicting  the  pipelined  8- 
bit  multiplier  array  (8bmmPL)  was  made.  Figure  71  shows  the  upper  third  and 
Figure  72  shows  the  middle  third  of  8bmmPL.  Figure  73  shows  the  lower  third 
of  this  array. 
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Figure  71  CAD  Layout  of  8bnimPL  (Upper  Third) 


The  basic  signal  naming  scheme  was  modified,  due  to  the  presence  of 
pipelined  stages,  by  use  of  an  underline  character  to  indicate  signals  which 
passed  through  pipelined  stages. 
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Figure  73  CAD  Layout  of  8bmmPL  (Lower  Third) 

Note  in  Figure  73  how  the  first  two  adders  are  separated  from  the  final 
row  of  adders  in  level_9.  This  resulted  from  the  splitting  of  the  original  7-bit 
adder  in  order  to  pipeline  in  four  stages.  The  floorplan  for  the  array  is  shown  in 
Figure  74  and  the  GENESIL  layout  is  shown  in  Figure  75.  One  can  clearly  see 
the  individual  levels  and  pipeline  registers.  However,  one  can  also  see  unused 
spaced  between  the  first  two  stages  to  the  left  and  right  of  the  array.  One  can  also 
see  the  two  adders,  which  produce  P8  and  P9,  and  the  empty  space  surrounding 
them.  Yet,  overall,  the  structure  clearly  shows  the  logic  flow  of  the  array  and 
demonstrates  the  physical  concept  of  pipelining. 
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Following  the  functional  verification  of  8bmmPL,  a  timing  analysis  was 
conducted  to  determine  the  worst  case  paths.  The  results  are  shown  in  Figure  76. 
The  worst  path  was  determined  to  be  26.7  ns  which  corresponds  to  clock  rate  of 
approximately  37.45  MHz. 
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Figure  76  Worst  Case  Path  for  8bnunPL 
Finally,  8bmmPL  was  incorporated  into  a  multiplier  Chip 
(8bmulti_chip)  which  resulted  in  a  total  area  of  44,488.41  mils2.  Note  the  Chip 
Module  (8bmulti_chip)  is  approximately  222%  greater  in  total  area  than 
8bmmPL.  Figure  77  shows  the  GENESIL  layout  for  8bmulti_chip. 
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Figure  77  GENESIL  Layout  for  8bmulti_chip  (44,488.41  mils2) 

3.  16-Bit  Pipelined  Multiplier  Array 

A  16-bit  pipelined  multiplier  array,  incorporating  parallel  multiplier 
cells,  was  not  implemented  in  this  study;  however,  from  Figures  75  and  77  a 
projection  of  its  core  size  (without  PADS)  was  estimated  to  be  99,328  mils2  (256 
x  388),  while  its  Chip  size  was  estimated  at  140,800  mils2  (320  x  440).  Figure  78 
shows  a  Block  level  layout  for  this  multiplier.  Its  operating  speed  was  estimated 
at  38  MHz;  the  same  as  8bmmPL. 
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Figure  78  Block  Level  Layout  of  a  16-Bit  Pipelined  Multiplier 

Array 
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VI.  LIMITATIONS  OF  THE  SILICON  COMPILER 


It  was  a  goal  of  this  thesis  to  fully  explore  and  probe  the  GENESIL  Silicon 
Compiler  system  in  order  to  determine  its  practical  limits  in  parallel  multiplier 
array  design.  During  this  course  of  study,  two  apparent  limitations  of  the  GSC 
system  in  parallel  multiplier  array  design  were  discovered.  They  are: 

•  Component  density. 

•  Vertical  feedthrough. 

The  most  significant  limitation  of  the  GSC  system  appears  to  be  its  inability 
to  achieve  high  component  density  in  parallel  multiplier  arrays  of  the  type 
implemented  in  Chapter  5.  Here,  component  density  refers  to  the  relative 
distance  between  levels  of  a  parallel  multiplier  array,  as  well  as  between 
individual  components  comprising  the  array.  It  appears  that  high  density  is 
precluded  because  of  the  abutting  of  the  power  buses  Vdd  and  Vss  of  the 
individual  components  of  the  array.  Figure  79  shows  this  abutment  between 
adjacent  components.  Higher  density  might  be  achieved  if  the  power  buses  of 
adjacent  components  were  permitted  to  overlap.  Additionally,  the  relative  size 
(width)  of  the  power  buses  appears  to  be  a  factor  contributing  to  the  separation 
between  components. 

The  second  limitation  of  the  GSC  appears  to  be  its  inability  to  establish 
vertical  feedthrough  between  adjacent  levels  of  ADDER/AND  components  in  the 
parallel  multiplier  arrays  in  this  study.  As  stated  earlier,  an  attempt  was  made  to 
increase  the  density  of  the  arrays  by  collapsing  the  array  vertically  by  moving 
the  AND  gate  to  the  top  of  the  ADDER  and  then  rotating  the  two  blocks 
clockwise  90°.  After  rotating  the  two  blocks,  a  feedthrough  Block  was  attached  to 
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each  AND  gate.  This  proved  unsuccessful  in  passing  the  xi  from  the  AND  gate  of 
the  upper  level  to  the  AND  gate  in  the  level  below.  Figure  80  shows  just  one  of 
several  attempts  to  establish  vertical  feedthrough. 

Although  the  GSC  system  did  not  perform  as  desired  in  this  study,  it  offers  a 
viable  alternative  to  the  labor  intensive,  full  custom,  VLSI  graphic  layout  tools  in 
use  today. 
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Figure  BO  Vertical  Feedthrough 
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VII.  CONCLUSIONS 


A.  SUMMARY 

The  main  goal  of  this  thesis  was  to  describe  the  design  methodology  and  the 
process  of  employing  the  GENESIL  Silicon  Compiler  (V7.1)  in  the  layout  of  a 
pipelined  multiplier,  in  1.5  micron  CMOS  technology,  using  a  parallel  multiplier 
cell  array.  There  was  an  additional  goal  of  determining  the  practical  limits  of  the 
GSC  in  parallel  multiplier  array  design.  Finally,  there  was  the  intent  to  produce 
a  document  with  sufficient  background  material  for  those  readers  not  well  versed 
in  digital  design  methodology  in  order  that  they  might  gain  some  understanding 
of  the  methods  involved  in  the  design  of  a  pipelined  parallel  multiplier  array. 

The  material  in  Chapter  2  provided  a  brief  introduction  to  one  particular 
silicon  compiler,  namely  the  GENESIL  Silicon  Compiler  (GSC).  Chapter  3 
provided  a  review  of  the  basic  principles  of  digital  multipliers,  while  Chapter  4 
covered  the  basic  concept  and  theory  of  pipelining.  The  design  iterations  of 
several  pipelined  parallel  multiplier  arrays,  incorporating  parallel  multiplier 
cells,  were  presented  in  Chapter  5.  Comments  regarding  the  practical  limits  of 
the  GSC  system  when  implementing  the  parallel  multiplier  array  designs  of  this 
study  were  presented  in  Chapter  6. 

The  results  of  this  thesis  indicate  that  a  parallel  multiplier  array, 
incorporating  parallel  multiplier  cells,  can  be  successfully  implemented  in  the 
GSC  system.  However,  two  practical  limits  of  the  GSC  system  precluded 
achieving  the  degree  of  high  component  density  (smaller  size)  made  possible  by 
full  custom  manual/CAD  design  methods  using  graphic  layout  tools. 
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B.  RECOMMENDATIONS 

The  author  makes  the  following  recommendations: 

•  Install  version  8.0  of  the  GENESIL  Silicon  Compiler  at  the  Naval 
Postgraduate  School  as  soon  as  possible. 

•  Explore  version  8.0  fully  to  determine  its  capability  to  establish 
vertical  feedthrough.  If  successful,  incorporate  this  feature  into  future 
parallel  multiplier  array  designs  for  comparison  with  full  custom 
manual/CAD  designs  using  graphic  layout  tools. 

•  Investigate  ways  to  reduce  the  CPU  loading  on  the  VAX  system  during 
normal  working  hours  in  order  to  enhance  the  performance  of  the  GSC 
system. 

•  Allow  for  3-4  months  in  learning  to  use  the  GSC.  Preferably  one  should  also 
attend  the  one  week  training  course  offered  by  Silicon  Compiler  System 
Corporation  of  San  Jose,  California. 

•  Incorporate  the  GSC  system  into,  and  make  it  a  regular  part  of,  a  course  of 
instruction  at  the  Naval  Postgraduate  School. 
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